Link to video presentation : https://youtu.be/feObLhJLg7o
Flooding is the most common natural disaster for cities living near water. With increasing frequency and severity due to factors such as climate change and urbanization, there is a growing need for accurate flood inundation probability maps to help communities prepare and mitigate impacts.
The project aims to estimate a predictive flood inundation model that predicts the probability of an area being flooded. The model will be trained and validated using data from Calgary, Alberta (Canada) and used to predict flooding probability in Denver, a city with similar hydrologic and geographic features. This project will utilize both ArcGIS and R software to develop and analyze the flood inundation model. ArcGIS will be used to process and analyze spatial data, while R will be used for statistical analysis and machine learning algorithms. The resulting flood inundation probability maps will provide information of likelihood of flooding of Denver and provide a reference of flood inundation prediction model for other cities.
To get started, we need to install and call the libraries we need. We
will use a set of ggplot styles we call mapTheme and
plotTheme.
library(plotROC)
library(tidyverse)
library(sf)
library(ggplot2)
library(spdep)
library(caTools)
library(plotROC)
library(caret)
library(pROC)
library(viridis)
library(gridExtra)
library(cowplot)
library(patchwork)
mapTheme <- theme(plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.75),
axis.ticks=element_blank())
ArcGIS Pro was employed to generate predictive features such as slope, distance to streams with drainage greater than 50km2 and 100km2, flow accumulation, and imperious percentage of surface from terrain and impervious raster data. Furthermore, the predicted variable was obtained from Calgary open data, specifically the percentage area within the 10-year (10%) inundated zone, which was subsequently employed as the target variable. A similar process was employed in generating predictive features for Denver, with the exception that inundation data was not available
predictors <- c(
"boundary", # ---------- 1: inside city boundary; 0: outside city boundary
"dem", # --------------- From DEM; meters
"slope", # ------------- Percentage rise
"dist_big_streams", # -- Distance (meters) to streams with drainage > 50 km2
"dist_huge_streams", # - Distance (meters) to streams with drainage > 100 km2
"flow_accumulation", # - Flow accumulation (number of cells)
"impervious" # --------- Impervious surface as percange of area
)
targets <- c(
"inundation_1pct", # --- Percange area in 100-year (1%) inundated zone
"inundation_10pct" # --- Percange area in 10-year (10%) inundated zone
)
The next step involved generating fishnet data for both cities, which was then used to aggregate the raster feature data. Zonal statistics were applied to compute the average values of raster features, with a 30 by 30-meter cell size. Next, a self-defined function was used to loop through and integrate the features into separate geo-dataframes for Calgary and Denver, respectively.
add_variables_from_csv <- function(city_fishnet, city_name, variable_list) {
for (variable in variable_list) {
path <- paste0(
"~/Documents/GitHub/flood-inundation-map/data/fishnet-output/tbl_", city_name, "_", variable, ".csv"
)
data <- read.csv(path) %>%
rename(id := OID_, !!variable := value)
city_fishnet <- city_fishnet %>%
left_join(data, by = "id")
}
city_fishnet <- city_fishnet %>%
filter(boundary > 0.9) %>%
dplyr::select(-boundary)
return(city_fishnet)
}
calgary <- st_read("~/Documents/GitHub/flood-inundation-map/data/fishnet-output/calgary_fishnet.shp") %>%
dplyr::select(id, geometry) %>%
add_variables_from_csv("calgary", c(predictors, targets))
denver <- st_read("~/Documents/GitHub/flood-inundation-map/data/fishnet-output/denver_fishnet.shp") %>%
dplyr::select(id, geometry) %>%
add_variables_from_csv("denver", predictors)
During the data exploration and manipulation stage, histograms were
plotted to examine the distribution of the features. It was observed
that the distributions of some features such as
dist_big_streams, dist_huge_streams, and
flow_accumulation were skewed, posing problems for
analysis. To address this, a logarithmic transformation was applied to
these features to normalize their distribution.
Below are the histograms of Calgary dataset.
## Function to plot histograms of the variables
data_df <- as.data.frame(calgary)
plot_histograms <- function(data, predictors,id_col = "id") {
data_df <- as.data.frame(data)
data_numeric <- data_df %>% dplyr::select(-!!sym(id_col)) %>% dplyr::select_if(is.numeric)
# Initialize an empty list to store the plots
plots <- list()
for (predictor in predictors) {
# Create a histogram for each predictor in the dataset
p <- ggplot(data, aes_string(predictor)) +
geom_histogram(fill = "grey") + # Removed binwidth parameter
labs(title = paste("Histogram of", "\n",predictor),
x = predictor,
y = "Frequency") +
theme_minimal()+
theme(plot.title = element_text(size = 12))
# Add the histograms to the list
plots <- append(plots, list(p))
}
# Arrange the plots in a grid with 3 columns and 2 rows
grid.arrange(grobs = plots, ncol = 3, nrow = 2)
}
# Call the function
plot_histograms(data=calgary, predictors = predictors2)
Below are the histograms of Denver dataset.
plot_histograms(denver, predictors2)
# Function to engineer features
# for calgary
engineer_features_calgary <- function(data) {
data <- data %>%
mutate(
log_dist_big_streams = log(dist_big_streams),
log_dist_huge_streams = log(dist_huge_streams),
log_flow_accumulation = log(flow_accumulation),
inundated = ifelse(inundation_1pct > 0.5, 1, 0)
)
return(data)
}
# for denver
engineer_features_denver <- function(data) {
data <- data %>%
mutate(
log_dist_big_streams = log(dist_big_streams),
log_dist_huge_streams = log(dist_huge_streams),
log_flow_accumulation = log(flow_accumulation),
#inundated = ifelse(inundation_1pct > 0.5, 1, 0)
)
return(data)
}
As for the target variable inundated, which was
binarized from the original variable inundation_1pct,
percentage area in 100-year (1%) inundated zone. The inundated variable
was assigned a value using an ifelse statement. If the inundation_1pct
variable was greater than 0.5, then the inundated variable was assigned
a value of 1, which is positive in inundation. Otherwise,
the inundated variable was assigned a value of 0.
# Get the predictors and target
predictors_used <- c(
"dem", # --------------- From DEM; meters
"slope", # ------------- Percentage rise
"log_dist_big_streams",
"log_dist_huge_streams",
"log_flow_accumulation",
"impervious" # --
)
lag_predictors_used <- predictors_used %>%
sapply(., function(x) {
paste0("lag_", x)
}) %>%
unname()
all_predictors_used <- c(predictors_used, lag_predictors_used)
target_used <- c("inundated")
calgary_used <- engineer_features_calgary(calgary) %>%
dplyr::select(all_of(c("id", predictors_used, target_used)))
denver_used <- engineer_features_denver(denver) %>%
dplyr::select(all_of(c("id", predictors_used)))
In the context of spatial analysis, it is important to account for
spatial autocorrelation, which is the tendency for the values of a
variable to be more similar at nearby locations than at distant ones.
The calculate_spatial_lags function is used to address this
issue by generating spatial lag variables, which are the weighted
averages of the values of a variable in the neighboring locations.
Incorporating spatial lags into our model helps capture the spatial
patterns and dependencies within the data, leading to more accurate and
reliable predictions.
# add predictor spatial lags
calculate_spatial_lags <-
function(fishnet, predictors, id_col = "id", geometry_col = "geometry") {
# Create neighbors list using the 'geometry' column
nb <- poly2nb(fishnet, row.names = fishnet[[id_col]])
# Create spatial weights matrix
swm <- nb2listw(nb, style = "W", zero.policy = TRUE)
# Calculate spatial lags for the specified predictor variables
for (predictor in predictors) {
spatial_lag_colname <- paste0("lag_", predictor)
predictor_values <- as.numeric(fishnet[[predictor]])
fishnet[[spatial_lag_colname]] <- lag.listw(swm, predictor_values, zero.policy = TRUE)
}
return(fishnet)
}
calgary_used <- calgary_used %>%
calculate_spatial_lags(predictors_used)
denver_used <- denver_used %>%
calculate_spatial_lags(predictors_used)
To build the model, relevant features and the target variable were selected, and then the data was randomly split into training (75%) and testing (25%) sets.
calgary_for_model <- calgary_used %>%
dplyr::select(all_of(c(all_predictors_used, target_used)), id)
train_ratio <- 0.75
sample <- sample.split(calgary_for_model$inundated, SplitRatio = train_ratio)
train <- subset(calgary_for_model, sample == TRUE)
test <- subset(calgary_for_model, sample == FALSE)
The training set was used to train a logistic model using the
glm() function, which specified the target variable
inundatedand all other predictors. The predict function was
then used to generate predicted probabilities for the test
dataset, which was stored in the predicted_probs
column.
model1 <- glm(
inundated ~ ., train %>% dplyr::select(-id) %>% st_drop_geometry(),
family = "binomial"(link = "logit")
)
test$predicted_probs <- predict(model1, test, type = "response")
The summary output of the training logistic model provides the
results of the logistic regression model. The summary shows which
features are statistically significant in predicting the target variable
in Calgary. In this case, the features dem,
slope, log_flow_accumulation,
lag_dem, lag_log_dist_big_streams,
lag_log_dist_huge_streams, and lag_impervious
were found to be statistically significant in predicting the target
variable. This means that these features have a significant impact on
the likelihood of an area being inundated.
summary(model1)
##
## Call:
## glm(formula = inundated ~ ., family = binomial(link = "logit"),
## data = train %>% dplyr::select(-id) %>% st_drop_geometry())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1943 -0.0357 -0.0054 -0.0003 3.3317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 68.13000 5.98079 11.391 < 2e-16 ***
## dem -0.30015 0.04847 -6.192 5.93e-10 ***
## slope -0.15231 0.04625 -3.293 0.00099 ***
## log_dist_big_streams -3.03248 1.78000 -1.704 0.08845 .
## log_dist_huge_streams 3.02408 1.79683 1.683 0.09237 .
## log_flow_accumulation 0.10014 0.05063 1.978 0.04793 *
## impervious 0.99314 0.69948 1.420 0.15566
## lag_dem 0.24464 0.04798 5.099 3.41e-07 ***
## lag_slope -0.01791 0.08954 -0.200 0.84144
## lag_log_dist_big_streams 9.04560 2.94922 3.067 0.00216 **
## lag_log_dist_huge_streams -11.78866 2.97770 -3.959 7.53e-05 ***
## lag_log_flow_accumulation 0.17812 0.11277 1.580 0.11422
## lag_impervious 6.39102 1.02641 6.227 4.77e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2165.58 on 6438 degrees of freedom
## Residual deviance: 724.03 on 6426 degrees of freedom
## AIC: 750.03
##
## Number of Fisher Scoring iterations: 10
After identifying the statistically significant features, maps of those features were plot to visualize the spatial distribution of Calgary.
# Create individual plots
plot1 <- ggplot() +
geom_sf(data=calgary_for_model, aes(fill=dem)) +
scale_fill_viridis() +
labs(title="DEM in meters") + mapTheme
plot2 <- ggplot() +
geom_sf(data=calgary_for_model, aes(fill=slope)) +
scale_fill_viridis() +
labs(title="Slope,percentage rise") + mapTheme
combined_plot <- plot_grid(plot1, plot2,ncol=2)
print(combined_plot)
plot3 <- ggplot() +
geom_sf(data=calgary_for_model, aes(fill=log_flow_accumulation)) +
scale_fill_viridis(name="") +
labs(title="Log flow accumulation") +
mapTheme
plot4 <- ggplot() +
geom_sf(data=calgary_for_model, aes(fill=lag_log_dist_big_streams)) +
scale_fill_viridis(name="") +
labs(title="Spatial lag log", "\nDistance (meters) to streams with drainage > 50 km2") +
mapTheme
combined_plot <- plot_grid(plot3,plot4,ncol=2)
print(combined_plot)
plot5 <- ggplot() +
geom_sf(data=calgary_for_model, aes(fill=lag_log_dist_huge_streams)) +
scale_fill_viridis(name="") +
labs(title="Spatial lag log", "\nDistance (meters) to streams with drainage > 100 km2") + mapTheme
plot6 <- ggplot() +
geom_sf(data=calgary_for_model, aes(fill=lag_impervious)) +
scale_fill_viridis(name="") +
labs(title="Spatial lag","\nImpervious surface as percange of area") + mapTheme
combined_plot <- plot_grid(plot5, plot6,ncol=2)
print(combined_plot)
The summary output only gives the overall AIC model performance and statistically significant features. However, the summary could not evaluate the error and accuracy.
ggplot(test, aes(x = predicted_probs, fill = as.factor(inundated))) +
geom_density() +
facet_grid(inundated ~ ., scales = "free") +
xlim(-0.1, 1) +
labs(
x = "Predicted Probability of Inundation",
y = "Probability Density",
title = "Distribution of predicted probabilities by observed outcome"
)+
scale_fill_manual(values = c("dark blue", "dark red"),
labels = c("Not Inundated","Inundated"),
name = "")+
geom_vline(xintercept = .5)
To take a closer look, probability density plots were generated to
show the distribution of predicted probabilities of inundation for the
test set. The plots are arranged in a grid, with one plot for each value
of the inundated variable. The blue plot represents areas that are not
inundated, while the red plot represents areas that are inundated. The
vertical line represents a 0.5 probability of inundation. If predicted
probability is below 0.5, the predicted class is 0 (not
inundated) and vice verses.
From the plot, we can observe that the False Negative rate is rather
high. This indicates that there are cases where the actual class is
1 (inundated), but the model incorrectly assigns a
predicted class of 0 (not inundated) due to the predicted
probability being below 0.5. This suggests that the model may not be
accurately identifying areas that are likely to be inundated.
Although the False Negative rate is high, the plot reveals a high True Positive rate. This means that the model is correctly predicting areas that are likely to be inundated, with a high probability. The high True Positive rate is encouraging, as it indicates that the model has some predictive power and can be useful in identifying areas that are vulnerable to flooding. However, it is still important to address the False Negative rate to improve the model’s accuracy and reliability.
The plot below shows the ROC curve and AUC for the model. The AUC value of 0.99 indicates that the model has a very high degree of accuracy in predicting whether an area will be inundated or not. However, it has potential to be overfitting with new datasets.
roc_data <- data.frame(
D = as.logical(test$inundated),
M = test$predicted_probs
)
ggplot(roc_data, aes(d = D, m = M)) +
geom_roc() +
geom_abline(slope = 1, intercept = 0, linewidth = 1.5, color = "grey") +
labs(
title = "ROC Curve",
subtitle = paste("Area Under Curve (AUC):", round(auc(pROC::roc(test$inundated, test$predicted_probs)), 4)),
x = "False Positive Rate (FPR)",
y = "True Positive Rate (TPR)"
)
To generated predicted inundation, the entire Calgray
datasetcalgary_for_model was used to create a logistic
regression model. where the target variable is inundated, and all other
variables are predictors. After training the model, the
predict function is used to generate predicted
probabilities for the entire Calgary dataset. The predicted
probabilities are stored in the
calgary_for_model$predicted_probs column.
model2 <- glm(
inundated ~ ., calgary_for_model %>% dplyr::select(-id) %>% st_drop_geometry(),
family = "binomial"(link = "logit")
)
summary(model2)
##
## Call:
## glm(formula = inundated ~ ., family = binomial(link = "logit"),
## data = calgary_for_model %>% dplyr::select(-id) %>% st_drop_geometry())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6589 -0.0438 -0.0076 -0.0005 3.3370
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 64.48838 4.92173 13.103 < 2e-16 ***
## dem -0.28848 0.04039 -7.142 9.17e-13 ***
## slope -0.14460 0.03919 -3.690 0.000224 ***
## log_dist_big_streams -1.61881 1.48371 -1.091 0.275250
## log_dist_huge_streams 1.74726 1.49393 1.170 0.242173
## log_flow_accumulation 0.11699 0.04310 2.714 0.006638 **
## impervious 0.84887 0.57019 1.489 0.136553
## lag_dem 0.23417 0.04006 5.845 5.05e-09 ***
## lag_slope -0.01413 0.07558 -0.187 0.851721
## lag_log_dist_big_streams 6.63901 2.18384 3.040 0.002365 **
## lag_log_dist_huge_streams -9.19426 2.19720 -4.185 2.86e-05 ***
## lag_log_flow_accumulation 0.23751 0.09360 2.537 0.011168 *
## lag_impervious 6.53214 0.84399 7.740 9.97e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2887.41 on 8584 degrees of freedom
## Residual deviance: 996.34 on 8572 degrees of freedom
## AIC: 1022.3
##
## Number of Fisher Scoring iterations: 10
calgary_for_model$predicted_probs <- predict(model2, calgary_for_model, type = "response")
To map the predicted inundation with error and accuracy, a confusion
matrix map was plotted. The mutate function is used to create a new
column called confResult, which categorizes the predicted
and actual classes into four possible outcomes: True Negative, True
Positive, False Negative, and False Positive.
calgary_for_model %>%
mutate(confResult=case_when(predicted_probs < 0.5 & inundated==0 ~ "True_Negative",
predicted_probs >= 0.5 & inundated==1 ~ "True_Positive",
predicted_probs < 0.5 & inundated==1 ~ "False_Negative",
predicted_probs >= 0.5 & inundated==0 ~ "False_Positive")) %>%
ggplot()+
geom_sf(aes(fill = confResult), color = "transparent")+
scale_fill_manual(values = c("Red","Orange","Light Blue","Light Green"),
name="Outcomes")+
labs(title="Confusion Metrics") +
mapTheme
Below is the predicted inundation map for Calgary trained using entire Calgary dataset.
ggplot() +
geom_sf(data=calgary_for_model, aes(fill=predicted_probs)) +
scale_fill_gradient(low = "blue", high = "red", name = "Predicted probability") +
labs(title="Predicted inundation") + mapTheme
Then the entire Calgary trained trained logistic regression model was
used to predict inundation probabilities for the Denver dataset. The
predict function is used to generate predicted probabilities for the
Denver dataset using the model object created previously. The predicted
probabilities are then stored in the predicted_probs column
of the denver_used dataset and mapped using
ggplot. This step allows us to apply the model to new data
and evaluate how well it generalizes to other areas.
Though the predicted probabilities for Denver were comparatively small, the predicted class were distinguishable in
denver_used$predicted_probs <- predict(model2, denver_used, type = "response")
ggplot() +
geom_sf(data=denver_used, aes(fill=predicted_probs)) +
scale_fill_gradient(low = "blue", high = "red",name = "Predicted probability") +
labs(title="Predicted inundation") + mapTheme
This project aimed to estimate flood inundation probabilities using a predictive model for the cities of Calgary and Denver. Using logistic regression models, the prediction performance was evaluated using a confusion matrix and an ROC curve, which indicated an AUC of 0.99 and overall good performance.
However, the predicted probabilities for Denver were comparatively small, indicating a limitation in the model’s ability to generalize to new areas. Furthermore, the AUC is unusually high which might indicate potential overfitting problems. For future improvement, one possible consideration is to validate the model using additional cities, which could make the model more rigorous and accurate in predicting flood inundation probabilities for new locations.